Internet Info 1997 December

home *** CD-ROM | disk | FTP | other *** search

/ Internet Info 1997 December / Internet_Info_CD-ROM_Walnut_Creek_December_1997.iso / ietf / urn / urn-archives / urn-ietf.archive.9610 / 000116_owner-urn-ietf _Fri Oct 25 15:32:00 1996.msg < prev next >

Wrap

Internet Message Format | 1997-02-19 | 4KB

Received: (from daemon@localhost) by services.bunyip.com (8.6.10/8.6.9) id PAA00902 for urn-ietf-out; Fri, 25 Oct 1996 15:32:00 -0400 Received: from mocha.bunyip.com (mocha.Bunyip.Com [192.197.208.1]) by services.bunyip.com (8.6.10/8.6.9) with SMTP id PAA00897 for <urn-ietf@services.bunyip.com>; Fri, 25 Oct 1996 15:31:58 -0400 Received: from josef.ifi.unizh.ch by mocha.bunyip.com with SMTP (5.65a/IDA-1.4.2b/CC-Guru-2b) id AA07959 (mail destined for urn-ietf@services.bunyip.com); Fri, 25 Oct 96 15:31:53 -0400 Received: from ifi.unizh.ch by josef.ifi.unizh.ch id <01036-0@josef.ifi.unizh.ch>; Fri, 25 Oct 1996 21:31:31 +0100 Subject: Re: [URN] URN Syntax thoughts To: masinter@parc.xerox.com Date: Fri, 25 Oct 1996 21:31:30 +0100 (MET) Cc: jayhawk@ds.internic.net, urn-ietf@bunyip.com In-Reply-To: <96Oct25.090902pdt."2759"@golden.parc.xerox.com> from "Larry Masinter" at Oct 25, 96 09:09:02 am Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Content-Length: 3112 From: Martin J Duerst <mduerst@ifi.unizh.ch> Message-Id: <"josef.ifi..739:25.09.96.20.31.46"@ifi.unizh.ch> Sender: owner-urn-ietf@services.bunyip.com Precedence: bulk Reply-To: Martin J Duerst <mduerst@ifi.unizh.ch> Errors-To: owner-urn-ietf@bunyip.com Larry Masenter wrote: >> Is it so much different to back away from the 10646/UTF-8 statement in >> the syntax document and say "As part of the registration procedure, the >> character set/encoding used by a namespace shall be documented, or some >> such words to that effect." > >This is completely inadequate: the user and client need to be able to >translate the thing that the user sees on the cereal boxtop (sequence >of glyphs) into what gets transmitted over the network (sequence of >octets) in an unambiguous way. The translation method can't vary >depending on the the first part of what's translated. I definitely agree that we should not let things float too much. In may view, it should run down to the following statements: - If you create something new, and deal with characters, use UTF-8. - If you have something old, try to move towards UTF-8. - If you don't do UTF-8, be aware that support is not guaranteed (i.e. you will have to print %HH on your cardboard box,...). [So this means that the translation method doesn't vary. It's just not there in some cases :-).] - If you use an octet sequence that is not UTF-8, we won't consider that an outright error. I repeat the main reasons for why I think this amount of flexibility is necessary. First, there are cases where you don't know what encoding is used. You could treat that like the data URL, but in some cases, this would mean that you would have to split functionality that now comes together. FTP is a typical example. It is very difficult to specify that servers that don't know the encodings of their filenames should use a data-like scheme, and those that know the encoding can use UTF-8. It is much more encouraging to say: If you know, convert to UTF-8; if you don't know, use raw octets. This gives a very smooth transition. The other reason for flexibility is political. Some people are, for some sometimes rather strange reasons, heavily against Unicode. It's much easier to deal with them if we don't "force everybody to use Unicode". Still, we want things to go in that direction, so simplify things, and so just saying "the namespaces take care" is of no use. A minor, but maybe serious, problem in this approach is the following: Assume at some point (either at entry or at some server) some general UTF-8 URN normalization/canonicalization is done. It's easy to say that no such thing should be done on octet sequences that are not legal UTF-8. But there is a slight possiblity that some arbitrary octet sequence is UTF-8 (and not pure ASCII). Here are the probabilities, derived mostly by simulation: Length of sequence approximate probability 1 0 2 1/32 3 1/28 4 1/35 5 1/48 6 1/70 7 1/107 8 1/175 9 1/287 10 1/465 15 1/6450 20 1/100000 Of course, the probabilities of octet sequences that would be affected by normalization is much much lower, esp. if this normalization is restricted to cases of the A-grave type. It might be possible to construct a RISK case out of this :-(. Still, I think a little bit of flexibility for backwards compatibility is desirable. Regards, Martin.